training pattern
Practical Confidence and Prediction Intervals
We propose a new method to compute prediction intervals. Espe(cid:173) cially for small data sets the width of a prediction interval does not only depend on the variance of the target distribution, but also on the accuracy of our estimator of the mean of the target, i.e., on the width of the confidence interval. The confidence interval follows from the variation in an ensemble of neural networks, each of them trained and stopped on bootstrap replicates of the original data set. A second improvement is the use of the residuals on validation pat(cid:173) terns instead of on training patterns for estimation of the variance of the target distribution. As illustrated on a synthetic example, our method is better than existing methods with regard to extrap(cid:173) olation and interpolation in data regimes with a limited amount of data, and yields prediction intervals which actual confidence levels are closer to the desired confidence levels.
On the Over-Memorization During Natural, Robust and Catastrophic Overfitting
Lin, Runqi, Yu, Chaojian, Han, Bo, Liu, Tongliang
Overfitting negatively impacts the generalization ability of deep neural networks (DNNs) in both natural and adversarial training. Existing methods struggle to consistently address different types of overfitting, typically designing strategies that focus separately on either natural or adversarial patterns. In this work, we adopt a unified perspective by solely focusing on natural patterns to explore different types of overfitting. Specifically, we examine the memorization effect in DNNs and reveal a shared behaviour termed over-memorization, which impairs their generalization capacity. This behaviour manifests as DNNs suddenly becoming high-confidence in predicting certain training patterns and retaining a persistent memory for them. Furthermore, when DNNs over-memorize an adversarial pattern, they tend to simultaneously exhibit high-confidence prediction for the corresponding natural pattern. These findings motivate us to holistically mitigate different types of overfitting by hindering the DNNs from over-memorization natural patterns. To this end, we propose a general framework, Distraction Over-Memorization (DOM), which explicitly prevents over-memorization by either removing or augmenting the high-confidence natural patterns. Extensive experiments demonstrate the effectiveness of our proposed method in mitigating overfitting across various training paradigms.
Network Generality, Training Required, and Precision Required
We show how to estimate (1) the number of functions that can be implemented by a particular network architecture, (2) how much analog precision is needed in the con(cid:173) nections in the network, and (3) the number of training examples the network must see before it can be expected to form reliable generalizations. Consider the following objectives: First, the network should be very powerful and ver(cid:173) satile, i.e., it should implement any function (truth table) you like, and secondly, it should learn easily, forming meaningful generalizations from a small number of training examples. Well, it is information-theoretically impossible to create such a network. We will present here a simplified argument; a more complete and sophisticated version can be found in Denker et al. (1987). It is customary to regard learning as a dynamical process: adjusting the weights (etc.) in a single network.
A competitive modular connectionist architecture
We describe a multi-network, or modular, connectionist architecture that captures that fact that many tasks have structure at a level of granularity intermediate to that assumed by local and global function approximation schemes. The main innovation of the architecture is that it combines associative and competitive learning in order to learn task decompositions. A task decomposition is discovered by forcing the networks comprising the architecture to compete to learn the training patterns. As a result of the competition, different networks learn different training patterns and, thus, learn to partition the input space. The performance of the architecture on a "what" and "where" vision task and on a multi-payload robotics task are presented.
For Valid Generalization the Size of the Weights is More Important than the Size of the Network
This paper shows that if a large neural network is used for a pattern classification problem, and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights. More specifi(cid:173) cally, consider an i-layer feed-forward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A. The misclassification probability con(cid:173) verges to an error estimate (that is closely related to squared error on the training set) at rate O((cA)l(l 1)/2J(log n)jm) ignoring log factors, where m is the number of training patterns, n is the input dimension, and c is a constant. This may explain the gen(cid:173) eralization performance of neural networks, particularly when the number of training examples is considerably smaller than the num(cid:173) ber of weights. It also supports heuristics (such as weight decay and early stopping) that attempt to keep the weights small during training.
This Is How I Utilize AI To Create One-of-A-kind Fairytales
A young girl once enjoyed reading fairytales. Every night before bed, she would read them with her mother. However, as she grew older, she realized that the stories didn't always make sense. They were extremely predictable, and the personalities didn't alter much. "We could do better than this," she thought to herself.
Non-Parametric Model
Non-parametric machine learning algorithms try to make assumptions about the data given the patterns observed from similar instances. For example, a popular non-parametric machine learning algorithm is the K-Nearest Neighbor algorithm that looks at similar training patterns for new instances. The only assumption it makes about the data set is that the training patterns that are the most similar are most likely to have a similar result. While non-parametric machine learning algorithms are often slower and require large amounts of data, they are rather flexible as they minimize the assumptions they make about the data.
Training Pattern
With supervised training, the desired inputs and outputs are provided by the trainer. The network then classifies the inputs and compares the resultant outputs against the benchmark outputs. Any errors are back-propagated throughout the system, which forces the network to adjust the various parameter weights. This continuous tweaking process repeats over and over, giving the "deep learning" name to the network.
92c/MFlops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster
Aberdeen, Douglas, Baxter, Jonathan, Edwards, Robert
Artificial neural networks with millions of adjustable parameters and a similar number of training examples are a potential solution for difficult, large-scale pattern recognition problems in areas such as speech and face recognition, classification of large volumes of web data, and finance. The bottleneck is that neural network training involves iterative gradient descent and is extremely computationally intensive. In this paper we present a technique for distributed training of Ultra Large Scale Neural Networks (ULSNN) on Bunyip, a Linux-based cluster of 196 Pentium III processors. To illustrate ULSNN training we describe an experiment in which a neural network with 1.73 million adjustable parameters was trained to recognize machine-printed Japanese characters from a database containing 9 million training patterns. The training runs with a average performance of 163.3 GFlops/s (single precision). With a machine cost of \$150,913, this yields a price/performance ratio of 92.4c/MFlops/s (single precision). For comparison purposes, training using double precision and the ATLAS DGEMM produces a sustained performance of 70 MFlops/s or \$2.16 / MFlop/s (double precision).
Towards Sampling from Nondirected Probabilistic Graphical models using a D-Wave Quantum Annealer
Koshka, Yaroslav, Novotny, M. A.
A D-Wave quantum annealer (QA) having a 2048 qubit lattice, with no missing qubits and couplings, allowed embedding of a complete graph of a Restricted Boltzmann Machine (RBM). A handwritten digit OptDigits data set having 8x7 pixels of visible units was used to train the RBM using a classical Contrastive Divergence. Embedding of the classically-trained RBM into the D-Wave lattice was used to demonstrate that the QA offers a high-efficiency alternative to the classical Markov Chain Monte Carlo (MCMC) for reconstructing missing labels of the test images as well as a generative model. At any training iteration, the D-Wave-based classification had classification error more than two times lower than MCMC. The main goal of this study was to investigate the quality of the sample from the RBM model distribution and its comparison to a classical MCMC sample. For the OptDigits dataset, the states in the D-Wave sample belonged to about two times more local valleys compared to the MCMC sample. All the lowest-energy (the highest joint probability) local minima in the MCMC sample were also found by the D-Wave. The D-Wave missed many of the higher-energy local valleys, while finding many "new" local valleys consistently missed by the MCMC. It was established that the "new" local valleys that the D-Wave finds are important for the model distribution in terms of the energy of the corresponding local minima, the width of the local valleys, and the height of the escape barrier.